An Integrated Two-Stage Framework for Robust Head Pose Estimation
نویسندگان
چکیده
Subspace analysis has been widely used for head pose estimation. However, such techniques are usually sensitive to data alignment and background noise. In this paper a two-stage approach is proposed to address this issue by combining the subspace analysis together with the topography method. The first stage is based on the subspace analysis of Gabor wavelets responses. Different subspace techniques were compared for better exploring the underlying data structure. Nearest prototype matching using Euclidean distance was used to get the pose estimate. The single pose estimated was relaxed to a subset of poses around it to incorporate certain tolerance to data alignment and background noise. In the second stage, the uncertainty is eliminated by analyzing finer geometrical structure details captured by bunch graphs. This coarse-tofine framework was evaluated with a large data set. We examined 86 poses, with the pan angle spanning from −90 to 90 and the tilt angle spanning from −60 to 45. The experimental results indicate that the integrated approach has a remarkably better performance than using subspace analysis alone. 1 Motivation and Background Head pose can be used for analyzing subjects’ focus of attention in ”smart” environment [1][2][3]. Head pose is determined by the pan angel β and the tilt angle α, as shown in the right image of Fig. 1. For applications in driver assistance systems, accuracy and robustness of the head pose estimation modular is of critical importance [3]. Besides focus analysis, head pose estimation is also a very useful front-end processing for multi-view human face analysis. The accurate pose estimate can provide necessary information to reconstruct the frontal view face for a better facial expression recognition [4]. Pose estimation can also help select the best view-model for detection and recognition [5][6]. Over the past several years, head pose estimation has been an active area of research. If there are multiple images available, pose position in the 3D space can be recovered using the face geometry. The input could be video sequences [3][4][7][8] as well as multi-camera output [9][10]. Following techniques have been proposed: feature tracking, including tracking the local salient features [4][8] or the geometric features [3][7]; studying the joint statistical property of image intensity and the depth information [9][10]. With only static images available, the 2D pose estimation problem has presented a different challenge. Pose can only be determined in certain degrees of freedom (DOF), instead of the full 6 DOF as the 3D one does. 2D pose estimation can be used as the front-end for multi-view face analysis [5][11]; as well as to provide the initial reference frame for 3D head pose tracking. In [12], the author investigated the dissimilarity between poses by using some specific filters such as Gabor filters and PCA. This study indicates that identity-independent pose can be discriminated by prototype matching with suitable filters. Some efforts have been put to investigate the 2D pose estimation problem [5][6][11][13][14] and they are mainly focused on the use of statistical learning techniques, such as SVC in [5], KPCA in [11], multi-view eigen-space in [14], eigen-space from best Gabor filter in [13], manifold learning in [6] etc. All these Fig. 1. Illustration of head pose estimation in focus analysis. algorithms are based on the features from entire faces. Although the identity information can be well-suppressed, one main drawback of such techniques is that they are sensitive to the face alignment, background and scale. Some researchers also explored the problem by utilizing the geometric structure constrained by representative local features [15, 16]. In [15], the authors extended the bunch graph work from [17] to pose estimation. The technique provides the idea to incorporate the geometric configuration for the 2D head pose estimation. However, the study is only based on 5 well-separated poses. The other poses not included can be categorized into these 5 poses by extensive elastic searching. Although this benefits the multi-view face recognition problem, it is not suitable for head pose estimation in a fine scale, since the elastic searching introduces ambiguity between similar poses. In [16], Gabor wavelets network, or GWN, which is constructed from the Gabor wavelets of local facial features, was used to estimate the head pose. One drawback is that it requires selected facial features to be visible, hence not suitable for head pose estimation with wide angle changes. In this paper, our aim is to get a robust identity independent pose estimator over a wide range of angles. We propose a two-stage framework which combines the statistical subspace analysis together with the geometric structure analysis for more robustness. The main issue we want to solve is the robustness to data alignment and background. More details are discussed below. 2 Algorithm Framework The proposed solution is a two-stage scheme in a coarse-to-fine fashion. In the first stage, we use subspace analysis in a Gabor wavelet transform space. Our study indicates that statistical subspace analysis is insufficient to deal with data misalignment and background noise, however, the noise does not drive the estimate far from its true value. Therefore, we can assume that the true pose locates in a subset of p× p neighboring poses around the estimate with a high accuracy. We use the subset of poses as the output from the first stage. This is similar to a fuzzy decision. The first-stage accuracy is evaluated accordingly: if the true pose locates in the p × p subset around the estimate, the estimate is determined as a correct one. Since geometric structure of the local facial features has the ability to provide the necessary detail for a finer pose assessment, in the second stage, we use a structural landmark analysis in the transform domain to refine the estimate. More specifically, we use a revised version of the face bunch graph [17]. The diagrams in Fig. 2 outline this algorithm. To get a comprehensive view of the underlying data structure, we study four popular subspaces so that the best subspace descriptors can be found: Principle Component Analysis (PCA) [18]; Kernel Principle Component Analysis (KPCA) [19]; Multiple class Discriminant Analysis (MDA) [18] and Kernel Discriminant Analysis (KDA) [20, 21]. Results show that analysis in the kernel space can provide a better performance. Also, discriminant analysis is slightly better than PCA (please refer to Table 1). To refine the estimate from the first-stage, semi-rigid bunch graph is used. Different from the face recognition task solved in [17], we only need to recover the identity-independent head pose. In [17], an exhaustive elastic graph searching is used so as to find the fiducial points that contains subjects’ identity. However, the distortion in the geometric structure caused by the exhaustive elastic search would introduce ambiguity for close poses. Furthermore, for pose estimation, we do not require exact match of the fiducial points since the nodes from Gabor jets are actually able to describe the neighborhood property. That is the reason we use the ”semi-rigid” bunch graph, in which the nodes can only be individually adjusted locally in legitimate geometrical configurations. We use multiple bunch graphs per pose to incorporate all available geometric structures. The reason is that the geometric structure captured by a single model graph is not subject-independent. Simply averaging is not sufficient to describe all subjects. Since the first stage estimation restricts the possible candidate in a small subset, the computational cost is still reasonable. The data span pan angles from −90 to 90 and tilt angle from −60 (head tilt down) to 45 (head tilt up). 86 poses are included, as shown in Fig. 3. Fig. 2. Flowchart of the two-stage pose estimation framework. The top diagram is for the first-stage estimation and the bottom one is for the second-stage refinement. The output of the first stage is the input of the second stage. 3 Stage 1: Multi-resolution Subspace Analysis Gabor wavelet transform is a convolution of the image with a family of Gabor kernels. All Gabor kernels are generated by a mother wavelet by dilations and rotations. Gabor wavelets provide a good joint spatial frequency representation. DC-free version of the Gabor wavelets can suppress the undesired variations, such as illumination change. Also, optimal wavelets can ideally extract the position and orientation of both global and local features [22]. Only magnitude responses are used in our algorithm since the phase response is too sensitive. 3.1 Subspace projection The wavelet features suffer from high dimensionality and no discriminant information are extracted. Subspace projection is used to reduce the dimensionality as well as extracting the most representative information. In this paper, we compare four popular subspaces for better discovering Fig. 3. Examples of the image set. The top two poses are not discussed because of lacking of enough samples. the underlying data structure, which are PCA, MDA and their corresponding nonlinear pair. For the clarity of presentation, in the following sections, the data set is denoted as {xi}i=1,···,N with C classes. Samples from c-th class are denoted as xc,i, i = 1, · · · , Nc, where N = ∑C c=1 Nc and {xi}i=1,···,N = ∪c=1{xc,j}j=1,···,Nc . Linear subspace projection PCA aims to find the subspace that describes most variance while suppresses known noise as much as possible. PCA subspace is spanned by the principal eigenvectors of the covariance matrix, which is:
منابع مشابه
A two-stage head pose estimation framework and evaluation
Head pose is an important indicator of a person’s focus of attention. Also, head pose estimation can be used as the front-end analysis for multi-view face analysis. For example, face recognition and identification algorithms are usually view dependent. Pose classification can help such face recognition systems to select the best view model. Subspace analysis has been widely used for head pose e...
متن کاملTowards Multilevel Human Body Modeling and Tracking in 3D: Investigation in Laplacian Eigenspace (LE) Based Initialization and Kinematically Constrained Gaussian Mixture Modeling (KC-GMM)
Vision-based automatic human body pose estimation has many potential applications and it is also a challenging task. Together, these two factors have made vision-based human body pose estimation an attractive research area with closely related research areas including body pose, hand pose, and head pose estimation. Up to now, these research works however only deal with each task of estimating b...
متن کاملHead Pose Estimation System Based on Particle Filtering with Adaptive Diffusion Control
In this paper, we propose a new tracking system based on a stochastic filtering framework for reliably estimating the 3D pose of a user’s head in real-time. Our system estimates the pose of a user’s head in each image frame whose 3D model is automatically obtained at an initialization step. In particular, our estimation method is designed to control the diffusion factor of a motion model adapti...
متن کاملRobust Stereoscopic Head Pose Estimation in Human-Computer Interaction and a Unified Evaluation Framework
The automatic processing and estimation of view direction and head pose in interactive scenarios is an actively investigated research topic in the development of advanced human-computer or human-robot interfaces. Still, current state of the art approaches often make rigid assumptions concerning the scene illumination and viewing distance in order to achieve stable results. In addition, there is...
متن کاملRobust GPS/INS-Aided Localization and Mapping Via GPS Bias Estimation
We consider the problem of pose estimation in the context of outdoor robotic mapping. In such cases absolute position information from GPS is often available, making a full-blown SLAM implementation largely unnecessary. However, the peculiarities of GPS can lead to problems when using it in conjunction with a naive mapping system, as unpredictable biases tend to cause significant inconsistencie...
متن کامل